How Predictable Are Biological Sequences?
نویسندگان
چکیده
A major challenge facing computational biology is the post-sequencing analysis of patterns and motifs in genomic DNA sequences. In this study, we apply asymptotically optimal Sampled Pattern Matching predictor, recently developed by us, to analyze biological sequences (i.e., proteins and DNA). The SPM predictor for a given sequence X1; : : : ; Xn predicts the next symbol Xn+1 (or next K symbols) based on selecting a context of Xn+1, that is, it predicts the value of the most frequent symbol appearing at the so called sampled positions. These positions follow the occurrences of a fraction of the longest suÆx of the original sequence that has another copy inside X1X2 : : : Xn. In our previous paper [11] we estimated the redundancy of the SPM universal predictor, that is, we established that the probability the SPM predictor makes worse decisions than the optimal predictor is O(n ) for some 0 < < 1 2 as n ! 1. When SPM is applied to molecular sequences it proves its suitability to protein prediction and DNA prediction. Finally, we should add that our ultimate goal is to bring solid methods of information theory to analysis of molecular sequences. This paper is our preliminary attempt.
منابع مشابه
A computational method to analyze the similarity of biological sequences under uncertainty
In this paper, we propose a new method to analyze the difference and similarity of biological sequences, based on the fuzzy sets theory. Considering the sequence order and some chemical and structural properties, we present a computational method to cluster the biological sequences. By some examples, we show that the new method is relatively easy and we are able to compare the sequences of arbi...
متن کاملMining Biological Repetitive Sequences Using Support Vector Machines and Fuzzy SVM
Structural repetitive subsequences are most important portion of biological sequences, which play crucial roles on corresponding sequence’s fold and functionality. Biggest class of the repetitive subsequences is “Transposable Elements” which has its own sub-classes upon contexts’ structures. Many researches have been performed to criticality determine the structure and function of repetitiv...
متن کاملRelation Between RNA Sequences, Structures, and Shapes via Variation Networks
Background: RNA plays key role in many aspects of biological processes and its tertiary structure is critical for its biological function. RNA secondary structure represents various significant portions of RNA tertiary structure. Since the biological function of RNA is concluded indirectly from its primary structure, it would be important to analyze the relations between the RNA sequences and t...
متن کاملClustering of Short Read Sequences for de novo Transcriptome Assembly
Given the importance of transcriptome analysis in various biological studies and considering thevast amount of whole transcriptome sequencing data, it seems necessary to develop analgorithm to assemble transcriptome data. In this study we propose an algorithm fortranscriptome assembly in the absence of a reference genome. First, the contiguous sequencesare generated using de Bruijn graph with d...
متن کاملThe Phylogeny of Calligonum and Pteropyrum (Polygonaceae) Based on Nuclear Ribosomal DNA ITS and Chloroplast trnL-F Sequences
This study represents phylogenetic analyses of two woody polygonaceous genera Calligonum and Pteropyrum using both chloroplast fragment (trnL-F) and the nuclear ribosomal internal transcribed spacer (nrDNA ITS) sequence data. All inferred phylogenies using parsimony and Bayesian methods showed that Calligonum and Pteropyrum are both monophyletic and closely related taxa. They have no affinity w...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003